Advancing mortality rate prediction in European population clusters: integrating deep learning and multiscale analysis

Accurately predicting population mortality rates is crucial for effective retirement insurance and economic policy formulation. Recent advancements in deep learning time series forecasting (DLTSF) have led to improved mortality rate predictions compared to traditional models like Lee-Carter (LC). This study focuses on mortality rate prediction in large clusters across Europe. By utilizing PCA dimensionality reduction and statistical clustering techniques, we integrate age features from high-dimensional mortality data of multiple countries, analyzing their similarities and differences. To capture the heterogeneous characteristics, an adaptive adjustment matrix is generated, incorporating sequential variation and spatial geographical information. Additionally, a combination of graph neural networks and a transformer network with an adaptive adjustment matrix is employed to capture the spatiotemporal features between different clusters. Extensive numerical experiments using data from the Human Mortality Database validate the superiority of the proposed GT-A model over traditional LC models and other classic neural networks in terms of prediction accuracy. Consequently, the GT-A model serves as a powerful forecasting tool for global population studies and the international life insurance field.

output of the GCN layer is fed to the Encoder layer of the transformer to capture individual countries' mortality trends and correlations over time.We utilize multiple linear layers instead of Decoder layer to further cluster spatio-temporal information and features.This design may project the hidden dimension to the desired output dimension, achieving end-to-end sequence prediction.Among them, the adjustment module adaptively adjusts

Initial Graph Information Output Matrix
The Hadamard product is an operation that multiplies two matrices or vectors element by element.

GCN layers
Assume that our input mortality data is X = (X i , i = 1, 2, . . ., m) , where X i ∈ R T×d , X i denotes the mortality rates in different countries, m is the number of countries included in the multi-cluster studied , T is length of time series and d is the dimension of ages.The input of GCN network is not only mortality data, but also an adjacency matrix A ∈ R m×m used to describe the distribution characteristics of data.The adjacency matrix A is designed to be composed of two parts multiplied by A DTW ∈ R m×m and A Lng−Lat ∈ R m×m , where A DTW is the similarity between the first principal components of mortality in different countries, and calculates the Dynamic Time Warping (DTW) 37 distances between m principal component sequences respectively.Matrix A Lng−Lat is the actual distance between countries, based on the latitude and longitude of each capital.In the task of realizing the simultaneous prediction of mortality in multiple countries, adding the prior conditions of the trend change of different clusters will improve the accuracy of the model 38 .Therefore, we categorize countries into the best-performing classes, as outlined in detail in section Numerical Application, and combine all countries into adjacency matrix, so that the internal relations of the same cluster are closer.We set up an adaptive adjustment matrix A ada to automatically adjust the correlation within clusters and between different clusters.
We define a correlation vector [α 1 , α 2 , . . ., α n , β] to help measure the correlation between national mortal- ity data, where n is the optimal number of clusters, α ∈ [0, 1] under each cluster is used to reduce the distance between elements in n clusters, and β ∈ [1, ∞] is used to appropriately enlarge the distance between elements in different clusters.In Fig. 1, A ada has three cluster regions and one uncorrelated region.After performing the point-wise multiplication between the reference adjacency matrix and the adaptive adjustment matrix, it is placed into the model along with the original data.The propagation formula between the network layer and the layer of graph convolution neural network is as follows: where A = (A DTW • A Lng−Lat • A ada ) ∈ R m×m (" • " is the Hadamard product) is the adjacency matrix finally input by the model, I ∈ R m×m is the unit matrix, and D ∈ R m×m is the pairwise angle matrix of Ã , P l is the feature matrix of the lth layer, W l and b l are the weight matrix and the parametric matrix respectively.wepresent the detailed description of the algorithm used in our study.As shown in algorithm 1:

Transformer layer
The Transformer model has shown remarkable ability in modeling long-term dependencies and interactions in time series data 39 .In this article, we have focused on utilizing only the encoder component of the Transformer model 40 .We adopt a sliding window with size t and step size s to predict mortality.The encoder part of the model primarily consists of a multi-headed attention mechanism, which will be expressed as follows: where Q ∈ R T×d model is query item matrix, K ∈ R T×d model is key item matrix , V ∈ R T×d model is value item that needs to be weighted averaged.The Q, K, V matrix are obtained by multiplying each input vector of the encoder by three weight matrices are the parameters that the network needs to learn and train.The Q vector and the K vector are multiplied to obtain the attention score and determine the attention distribution.The standardized attention scores are (1) compared with each multiplied V vector, which are obtained by passing the attention scores through the softmax layer.The higher the score, the greater the multiplied value, and the more attention it receives.The following equation represents the formula for the multi-headed attention mechanism:

Numerical application data
Referring to the benchmark papers we selected 20 , we select the mortality data of 16 European countries with the satisfaction time of 1950 ≤ T L ≤ 2016 and the age of 0 ≤ d ≤ 100 from the human mortality database 41 .This is also consistent with the suggestion given by the Human Mortality Database that the data after 1950 is relatively stable because the Second World War ended in 1945.Mortality data during the world wars are not informative.
In the mortality database, there are 41 countries.Among them, 16  As shown in the Fig. 2, with the passage of time, the mortality rate of European countries in the four different geographical regions has generally shown a downward trend, while countries from different geographical regions have their own unique mortality curve characteristics.Therefore, by clustering and analyzing the mortality rate of the 16 European countries, we group them into different clusters and assign the same adaptive parameters to the countries within each cluster.This approach helps determine the distribution of adaptive parameters in the matrix A ada . (3)

Adjacency matrix
The input adjacency matrix A of GCN is introduced in this section, which is obtained by multiplying three matrices A ada , A DTW and A Lng−Lat , in which A DTW and A Lng−Lat are obtained by calculating the DTW distance and geographical distance between countries respectively.For the distribution of adaptive parameters in the adaptive adjustment matrix A ada , we get the best number of clusters by analyzing and clustering the data, and give the same adaptive parameters as the countries in the same cluster.

Dynamic time warping
To measure the similarity between sequences, the DTW algorithm is employed, which calculates the distance between two sequences as a measure of their similarity.The steps of the DTW algorithm are as follows: Firstly, calculate the distance matrix between the points of two sequences.Secondly, find a path from the upper left corner to the lower right corner of the matrix to minimize the sum of elements on the path.Finally, the distance between two time series A and B is the sum of the minimum values of all possible paths.Defining two time sequences of length n as A and B, M(i, j) = A(i) − B(j) is the distance matrix between the two sequences , where i ≥ 1 , j ≤ n .In the distance matrix, the path length from the top left corner to the bottom right corner is equal to sum of the path length of its step and the current element size, where the previous element of the element a i,j on the path should be b i,j−1 , c i−1,j , d i−1,j−1 .Then in the recursive algorithm, the minimum value L min (i, j) of the cumulative distance will be expressed as follows:

PCA analysis of age dimension in each country
Before clustering the data, we use principal component analysis (PCA) to reduce the dimension.The mortality series of each country is composed of several age groups, and each time point has multidimensional variables, so it is necessary to replace the overall variables with a few variables from the multidimensional time series, while retaining most of the information in the data.For individual country mortality data X i = (x 1i , x 2i , . . ., x di ) T ∈ R T×d where T is length of time series and d is the dimension of ages, i is the ith country.So that the mean vector is µ i = E(X i ) = (µ 1i , µ 2i , . . ., µ di ) T , and the covariance matrix ] .Considering the linear transformation of the d-dimensional vari- able X i to the variable y i = (y 1i , y 2i , . . ., y di ) T : where α j T = (α 1j , α 2j , . . ., α dj ) , j = 1, 2, . . ., d .The variable y 1 corresponds to the linear transformation of X i with the largest variance.By calculating the covariance matrix of X i , the first principal component of X i is obtained.The proportion of variance explained by the first principal component for each country will be found in the following Table 2: In PCA, the first principal component height represents the overall trend of mortality change.Therefore, we chose the first principal component for cluster analysis, taking Denmark and Finland as examples.As shown in Fig. 3, gray represents 101 standardized mortality curves that change with time, and the thick black line represents the first principal component, which clearly describes the overall trend of mortality changes.

Shape-based time series clustering with adjacency matrix construction
After measuring the distance or similarity between samples using metrics, we utilize the first principal component of the 16 countries mentioned above for DTW clustering.We choose K-means clustering method to cluster samples, which randomly selects k center points in k clusters.Then the center of each cluster is calculated by iteration, and samples are distributed according to the distance.The expressions for this algorithm are as follows: (5) www.nature.com/scientificreports/where C i is the i-th cluster.As the evaluation index of clustering, the contour coefficient is chosen, and the cal- culation formula is as follows: where j represents the other sample points within the same category as sample i.The value of b(i) needs to traverse other cluster groups to get {b 1 (i), b 2 (2), . . ., b m (i)} .In addition, we introduce the sum of squared errors (SSE): , in order to help evaluate the clustering criteria.Figure 4 displays the Silhouette Coefficient and SSE values obtained from clustering the first principal component.According to Fig. 4, we choose n = 3 as the optimal number of clusters.We plot the clustering results in Fig. 5, where France, Finland, Netherlands, Norway, Sweden, Spain, Belgium, Italy and Austria are clustered into the first category; Portugal, Britain, Switzerland, Italy and Austria belong to the second category; Denmark, Slovakia and the Czech Republic are clustered into the third category.( 6)

Results
In this chapter, we evaluate the effectiveness of our model through experiments, which are divided into two steps.The first step is to verify the performance of our model under the same large cluster forecast, and the second step is to compare our large cluster forecast results with those of other models in a single country, so as to verify that the prediction results of our model in a single country are equally effective.

Large cluster prediction results
According to the mortality data of our 16 countries, we set m = 16 and T = 16 , which indicates that the forecast time is 16 years.In order to compare the effects of our models, we choose the classic multivariate time series prediction models, namely LSTM, 1D-CNN and RNN.In our model training, we fixed the epoch to 500 for each model and used the Adam optimizer with the mean squared error (MSE) loss function.The initial learning rate is set to lr = 0.001 .To evaluate the performance of the models, we calculate the root mean squared error (RMSE) for different time periods, age groups, and overall predictions 42 .In addition to the RMSE metric, we also utilized the MAE and MAPE metrics to calculate overall performance for both dimensions.The evaluation indicators used are as follows: RMSE all , MAE all , MAPE all measure the prediction error of all countries, and the values calculated from the forecast values of each model are shown in Table 3.The results demonstrate that our model is superior to other series forecasting models in large cluster forecasting.In addition, the GT-A model with the adaptive adjustment ( 8) Figure 5. Mortality first principal component cluster analysis.www.nature.com/scientificreports/matrix shows improvement compared to the GT model, which proves the effectiveness of our proposed adaptive adjustment matrix.Figure 6 shows the RMSE all (t) values of different models from 2001 to 2016.With the passage of time, the RMSE values of all models gradually increase.Specifically, 1D-CNN, GT and GT-A show relatively stable performance, while LSTM and RNN show oscillation.As well as the latest references in the field of human mortality prediction using deep learning, as shown in the references 24 , we used the transformer model(TF) with the same parameters as a preliminary comparison.Among all the models, GT and GT-A perform best in the step size prediction task, and GT-A performs better than GT.
Figures 7 and 8 show the residuals of predicted values and true values of all countries in the six models in the time dimension and the age dimension respectively, in which red indicates overestimation and blue indicates underestimation.In Fig. 7, the red dotted line represents the linear fitting curve of scattering points and the overall trend of scattering errors.The errors of LSTM ,RNN and Transformer tend to be overestimated, while the errors of 1D-CNN, GT and GT-A are relatively smooth in overall error, and the slopes of GT and GT-A are the lowest.Figure 8 presents the heat map of the average mortality error by age in 16 countries.The prediction error is obtained by subtracting the average predicted value from the average true value and then subtracting the estimated standard deviation of each age group.
All models exhibit inhomogeneities in error oscillations at low ages, caused by the excessive transition differences between unstable mortality at low ages and stable mortality at middle and low ages.Therefore, mortality prediction in the lower age groups requires higher performance of the model.Figure 8 shows that 1D-CNN, LSTM, RNN,Transformer have large overestimated parts between low to middle and low ages.In contrast, although GT and GT-A models present oscillatory irregular regions at low ages, the overall residuals are low.In the middle age region, GT and GT-A models perform evenly without abrupt overestimation and underestimation, especially from around age 30 ≤ d ≤ 90 , which is the best performance.The higher age range exhibits areas of irregularity, similar to those observed in the lower age range.The GT-A model shows a small improvement compared to the GT model at low and high ages and overall has lower mean mortality error.The results demonstrate that our model outperforms traditional time series prediction models over a large cluster range, which indicate that additional spatial information, namely homogeneous information, improves prediction accuracy in the processing of large cluster data.Our model not only improves the performance of simultaneous prediction but also obtains better performance in single country forecasting.

Experiment of comparing the prediction results
For the prediction results of a single country, we define the following indicators to measure the prediction performance of different models: www.nature.com/scientificreports/In Fig. 9, performance metrics are calculated for all age groups and time horizons across 16 countries, and it is evident that the GT and GT-A models outperform the LC model in all countries.Specifically, in terms of RMSE, GT and GT-A outperformed the LC model by 52.4% and 59.2%, respectively.The following formula represents the percentage calculation method, where m denotes the total number of countries included in the calculation.
The index of the i-th country under model1 is denoted as RMSE i (model1) .In terms of MAE, GT and GT-A exceeded the LC model by 72.2% and 79.2%, respectively.Finally, in terms of MAPE, GT and GT-A exceeded the LC model by 206.4% and 223.6%, respectively.The countries with larger values across all metrics may be attributed to the small size of their population.For instance, the populations of the top three countries with the highest indicators, namely Denmark, Hungary, and Slovakia, are 5.911 million, 9.689 million, and 5.430 million, respectively.RMSFE (Root Mean Squared Forecast Error) can be utilized to assess the performance of our predictions, which bears similarities to RMSE (Root Mean Squared Error).Once we have obtained the RMSFE, it enables us to compute prediction intervals.These intervals represent the range within which we expect our predicted values for future observations to fall.Typically, prediction intervals are expressed as intervals that encompass the likely range of true observations.The width of the prediction interval is typically determined by the confidence level and the forecast error (RMSFE).The confidence level denotes the desired probability of the prediction interval encompassing the true observation.For instance, selecting a confidence level of 95% indicates that we aim for a 0.95 probability of the prediction interval covering the true observation as shown in Fig. 12.
In a normal distribution, around 95% of the observations lie within two standard deviations of the mean.Hence, leveraging the properties of the normal distribution, we can employ the RMSFE and confidence level to calculate the size of the prediction interval.

Discussion
In this study, we present a novel model GT-A to simultaneously predict mortality for multiple countries in large clusters, representing a major improvement in mortality prediction models for large clusters.By incorporating spatial information and similarity information between multidimensional series, the traditional task of mortality time series prediction is efficiently incorporated into spatial data for the first time.The results demonstrate that exploiting similarity information between multidimensional series to capture spatial location information in large clusters has the ability to effectively improve the accuracy of time series forecasting.In large-scale cluster experiments, the prediction performance of the GT-A model is better than that of the traditional time series prediction model.Furthermore, the addition of the adaptive adjustment matrix of clustering information improves the regional heterogeneity in large clusters, and enhances the accuracy of the model to some extent.These findings suggest that future research should be conducted on the heterogeneity existing in each dimension in the study of large clusters to further improve prediction accuracy.We also find that simultaneous prediction for large clusters will improve prediction accuracy of a single national data set (Supplementary Information).Compared with the traditional LC model with single data set prediction, GT-A achieves better performance across the board.As Li and Lee 11 said, it was possible to improve the prediction accuracy of mortality of a single population by capturing mortality trends common to several populations with higher statistical confidence.So far, there are various variants of the Transformer model.However, the mortality prediction model has not progressed synchronously, so this study will innovatively apply Transformer to mortality prediction and combine GCN network to capture geographic information.Furthermore, we also consider exploring the design of more suitable models for mortality rate prediction in future research to enhance its accuracy.For instance, combining attention mechanisms with the lc model could be a potential approach.This is because advancements in mortality rate prediction models have the potential to greatly benefit public health initiatives and decision-making processes.Moreover, the black box nature of neural networks presents uncertainty that prevents a full understanding of the mathematical and actuarial principles behind them, which is also the field of future research and development.

Figure 2 .
Figure 2. Log-mortality rates over time for each age in four countries.

Figure 3 .
Figure 3. Trends in mortality and standardization of data.

Figure 8 .Figure 9 .
Figure 8.The prediction error of all countries in the six models in the age dimension.

Table 1 .
Definitions of the variables.
European countries meet the span of 1950-2016.During data preprocessing, we take the average mortality rate of all countries at the same time and age instead of missing values.In order to train and evaluate our model, the proposed model is trained from 1950 to 2000 and the mortality from 2001 to 2016 is predicted.Figure2shows the natural logarithmic mortality trend of four selected countries: Sweden in northern Europe, Switzerland in central Europe, Britain in western Europe and Spain in southern Europe.Due to the limitation of time span, eastern European countries are not included.

Table 2 .
Proportion of four principal components in 16 countries (%).Significant values are in bold.

Table 4 .
Predictive performance of four models in different countries.a RMSE,MAE,MAPE are the overall RMSE,MAE,MAPE across all ages and time horizons in each countries.b Mean is the sample mean of the RMSEs over age groups.c Q 1 ,Q 3 are the first quartile, and third quartile of the RMSEs over age groups.d GT,GT-A are the GCN-Transformer model and the GCN-Transformer model with the addition of adaptively adjusted adjacency matrix, respectively.Significant values are in bold.